Red Wine Insights

Here, we’ll take a look at some of the chemical features that make these wines more or less enjoyable.

Data from a 2009 Vinho Verde dataset from Portugal:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Loading data and taking a look:

For some reason they separated these with a semicolon, so we’ll have to override the default comma separator

data <- read.csv("winequality-red.csv", sep = ";")
head(data)
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

Analyzing data that we don’t understand isn’t the best idea, so let’s establish what these mean:

  • Fixed acidity: this gives the wine its structure and ‘zing’…too much, and it’s sour, too little, and it tastes flat and flabby
  • Volatile acidity: while this comes from acid, it’s something you sooner smell than taste. Volatile acids become gases near room temperature and so we wouldn’t expect them to add much to enjoyment
  • Citric acid: adds to overall acidity, usually used as an additive to improve taste and make the wine more robust to other flavours
  • Residual sugar: wine is alcoholic because sugar ferments into alcohol. Any sugar left behind in this process is residual, and thus gives the wine a sweet taste
  • Chlorides: these are salts. This often has a lot to do with the kind of soil at the vineyard
  • Sulfur: a little reading online says that this is often added to help prevent bacterial growth, though I don’t know much about this part
  • Quality: subjective rating by tasters between 1 and 10

Now it would be good to check out some summary statistics and distributions for these attributes.

summary(data)
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

Thankfully no NA values.

Going on to look at the structure of the data:

str(data)
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

This all looks clean and well standardized. Time to look at distributions and outliers.

hist(data$fixed.acidity)

hist(data$volatile.acidity)

hist(data$citric.acid)
hist(data$citric.acid)

hist(data$residual.sugar)

hist(data$chlorides)

hist(data$free.sulfur.dioxide)

hist(data$total.sulfur.dioxide)

hist(data$density)

hist(data$pH)

hist(data$sulphates)

hist(data$alcohol)

hist(data$quality)

Most of these are positively skewed, and the “quality” ratings show the tasters generally thought most of these wines to be mediocre.

Looking for outliers:

boxplot(data$fixed.acidity)

boxplot(data$volatile.acidity)

boxplot(data$citric.acid)
boxplot(data$citric.acid)

boxplot(data$residual.sugar)

boxplot(data$chlorides)

boxplot(data$free.sulfur.dioxide)

boxplot(data$total.sulfur.dioxide)

boxplot(data$density)

boxplot(data$pH)

boxplot(data$sulphates)

boxplot(data$alcohol)

boxplot(data$quality)

Interesting. The positive skew is coming out here, so we have lots of outliers. Since this dataset has so many, I won’t do anything about the outliers just yet. Maybe what’s happening at the extremes has the coolest insights.

Heatmap to show correlations

Now that we have an idea of how our dataset looks, we can check out the correlations between the different attributes and see what jumps out:

library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.0.5
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.0.5
# Correlation matrix
corr <- round(cor(data), 1)

# p-value significance of correlation matrix
ps <- cor_pmat(data)

plot <- ggcorrplot(corr, hc.order = TRUE, type = "lower", outline.col = "white",
    p.mat = ps)
ggplotly(plot)
## Warning in L$marker$color[idx] <- aes2plotly(data, params, "fill")[idx]: number
## of items to replace is not a multiple of replacement length

Chemistry says some of these correlations make a lot of sense:

  • pH and acidity are strongly negatively correlated, since lower pH values mean greater acidity
  • citric acid increases fixed acidity, so they move together
  • people like alcohol, so that’s what was most strongly correlated with quality :)
  • alcohol and fixed (liquid) acids are less dense than water and volatile acids, so more of the former mean less of the latter

What makes an amazing red wine?

If we take the superb wines (8+ on the scale), the mediocre wines (4-7) and the dreadful ones (3-), we might see some differences that separate the good, the bad and the ugly.

data$category <- ifelse(data$quality >= 8, "superb", ifelse(data$quality <=
    3, "dreadful", "mediocre"))
ggplotly(ggplot(data, aes(x = category, fill = category)) + geom_bar())

Nearly all the wines are mediocre. Oh well.

Alcohol, fixed acid and quality

ggplotly(ggplot(data, aes(x = fixed.acidity, y = alcohol, fill = category)) +
    geom_point(size = data$quality))

Moderate to low fixed acidity and a high alcohol content seem to make the preferred wines.

Sweet and salt

We don’t typically think of sweet and salt for wine, but let’s see if there’s any useful takeway from that.

ggplotly(ggplot(data[data$category == "dreadful", ], aes(x = residual.sugar,
    y = chlorides)) + geom_point(color = "brown4") + geom_smooth(method = lm,
    color = "red"))
## `geom_smooth()` using formula 'y ~ x'
ggplotly(ggplot(data[data$category == "mediocre", ], aes(x = residual.sugar,
    y = chlorides)) + geom_point(color = "seagreen4") + geom_smooth(method = lm,
    color = "springgreen3"))
## `geom_smooth()` using formula 'y ~ x'
ggplotly(ggplot(data[data$category == "superb", ], aes(x = residual.sugar,
    y = chlorides)) + geom_point(color = "steelblue3") + geom_smooth(method = lm,
    color = "blue4"))
## `geom_smooth()` using formula 'y ~ x'

Now the scales of the axes and the sample sizes have a lot to do with hat we see here. But the takeaway is: if you’re going to make a salty red wine, at least make it sweet enough and strong enough to cover the taste.

Conclusion: people seem to like wines with lots of alcohol, moderate acidity and not a lot of salt – or maybe with just enough sugar to cover it over.